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^  ABSTRACT 

One  of  the  most  difficult  issues  that  must  be  addressed  -when  studying  a  class  of 
parallel  algorithms  is  the  problem  of  choosing  a  model  that  captures  the  inherent 
difficulty  of  implementing  these  algorithms  on  a  multiprocessor  architecture.  Shared 
memory  models  have  proven  to  be  an  effective  tool  for  deriving  lower  bounds  on  the 
complexity  of  comparison  problems.  In  particular,  a  speed-up  of  lg(P)  is  possible  for  the 
problem  of  finding  an  element  in  an  N-element  sorted  list,  and  speed-ups  of  P/lglgP  and 
P  are  possible  for  merging  N-element  sorted  lists  on  P  processors  for  the  cases  of  N»«P 
and  P  <  N  respectively. 

In  practice,  these  speed-ups  are  not  attainable  since  the  shared  memory  models  ig¬ 
nore  many  practical  considerations  in  multiprocessor  systems,  such  as  interprocessor 
communications,  distribution  of  data  on  local  memories  and  limited  fan-out  of  memory 
locations.  In  this  paper  we  introduce  a  model  for  parallel  computation  that  is  strictly 
weaker  than  the  shared  memory  models.  The  model  is  based  on  an  actual  machine 
currently  being  constructed  (ZMOB).  We  examine  the  communication  facilities  available 
in  the  model  and  show  that  lower  bounds  for  merging  and  searching  on  shared  memory 
models  are  atainable  (within  a  constant) AThe  main  results  reported  in  the  paper  are: 

-an  0(lgN/lgP)  algorithm  for  searching^!  N-element  sorted  list  distributed 
on  P  processors.  \ 

-an  0(N/P)  algorithm  for  merging  two  N-eIe!n«nt  lfots  on  2P  processors. 

•an  O(lgn)  algorithm  for  merging  two  N-element  lists  cn  2N  processors. 

•criteria  and  techniques  for  simulating  CREW  PRAM  algorithms  on  ZMOB. 

One  of  the  techniques  is  used  to  establish  an  O(IglgN)  lower  bound 
for  merging  two  N-element  lists  on  2N  processors. 

Research  sponsored  by  the  Air  Force  Office  of  Scientific  Research  (AFSC),  under  Contract 
F49620-83-C-0082.  The  United  States  Government  is  authorized  to  reproduce  and  distribute  re¬ 
prints  for  government  purposes  notwithstanding  any  copyright  notation  hereon. 
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Introduction 


During  the  last  decade  a  significant  amount  of  pro¬ 
gress  has  been  made  towards  the  understanding  of  the  value 
of  parallelism  for  specific  computational  problems.  Sort¬ 
ing,  searching  and  merging  are  three  of  the  most  fundamen¬ 
tal  tasks  in  computer  science.  Their  significance  is  due 
to  the  key  role  they  play  in  many  domains  of  application. 
With  the  emergence  of  VLSI  technology  it  is  inevitable  for 
us  to  wonder  how  fast  these  tasks  may  be  performed  on  a 
parallel  (multiprocessor)  machine.  However,  in  contrast  to 
Von-Neumann  machines  whose  execution  is  well  understood, 
and  for  which  we  have  a  well  established  theory  of  the 
time  complexity  of  comparison  problems,  this  is  not  the 
case  for  parallel  machines.  The  problem  is  due  to  the 
difficulty  in  correctly  modeling  the  execution  of  a  physi¬ 
cally  realizable  parallel  computer.  Thus,  one  of  the  most 
difficult  issues  that  must  be  addressed  when  studying  a 
class  of  parallel  algorithms  is  the  problem  of  chosinq  a 
model  that  captures  the  inherent  difficulty  of  implement¬ 
ing  these  algorithms  on  a  multiprocessor  architecture. 

Many  models  have  been  proposed  to  solve  this  problem. 
Roughly  speaking,  parallel  models  of  computation  belong 
to  two  categories:  shared  memory  models  and  fixed  inter¬ 
connection  models.  A  typical  shared  memorv  model  allows 
many  orocessors  to  read  the  same  location  simultaneously. 


but  disallows  concurrent  writes  to  the  same  location. 
Since  the  model  allows  concurrent  reads  and  insists  on 
exclusive  writes  it  is  known  in  the  literature  as  the  CREW 
PRAM.  Examples  of  fixed  interconnection  networks  are  the 
shuffle  exchange  network,  mesh-connected  array  and  n- 
dimensional  hypercube. 

Shared  memory  models  are  currently  not  realizable  in 
practice.  However,  they  serve  as  powerful  analysis  tools 
to  derive  lower  bounds  for  parallel  computers.  That  is,  if 
we  can  show  that  the  worst  case  time  complexity  of  an 
algorithm  is  0(N)  for  the  CREW  PRAM,  we  have  also  esta¬ 
blished  that  no  fixed  interconnection  network  based  paral¬ 
lel  machine  can  perform  faster  (see  fl]  for  a  spectrum  of 
models  and  their  relationships) .  Establishing  lower 
bounds  for  searching,  merging  and  sortinq  was  indeed  the 
motivation  for  the  powerful  comparison  model  introduced  by 
Valiant.  In  F 8 ]  several  optimal  algorithms  for  comoarison 
problems  are  presented.  The  algorithm  for  merging  was 
later  shown  to  be  implementable  on  the  CREW  PRAM  [1],[3], 

This  oaper  is  a  further  step  in  this  direction.  We 
introduce  a  model  for  parallel  computation,  called  ZMOB. 
The  ZMOB  model  is  shown  to  be  strictly  weaker  than  the 
CREW  PRAM.  Despite  this  fact  we  demonstrate  that  within 
the  constraints  of  this  model  we  are  still  able  to  achieve 
(up  to  a  constant)  the  lower  bounds  for  searching  and 
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merging  that  are  attainable  on  a  more  powerful  model  of 
computation.  Additionally,  we  define  two  fundamental  cri¬ 
teria,  that  if  obeyed  allow  us  to  simulate  any  CREW  PRAM 
algorithm  on  the  model  investigated  in  this  paper. 

The  outline  of  the  paper  is  as  follows: 

In  Section  2  we  describe  the  parallel  model  of  compu¬ 
tation  used  in  this  paper,  and  establish  its  relationship 
to  the  CREW  PRAM. 

In  Section  3  we  investigate  the  problem  of  searching 
an  N-element  sorted  list.  We  define  the  criteria  for 
optimality  of  distribution  of  elements  to  processors, 
and  give  an  allocation  function  of  elements  to  processors 
that  allows  us  to  search  the  list  in  0(N/P)  time.  This 
algorithm  is  shown  to  be  optimal.  Since  no  communication 
is  needed  after  the  element  searched  for  is  broadcast  to 
all  the  processors  the  result  in  Section  3  is  not  res¬ 
tricted  to  ZMOB. 

In  Section  4  we  present  three  algorithms  for  merging 
two  N-element  strings  on  a  P-processor  ZMOB.  Two  of  the 
algorithms  are  shown  to  be  optimal  up  to  a  constant. 


Finally,  in  Section  5  we  conclude  with  some  thoughts 
oh  extensions  of  the  research  reported  in  this  paper,  and 
with  a  discussion  of  the  cost  effectiveness  of  the  model 
described  in  Section  2. 

2.  A  Model  for  Parallel  Computation 

The  model  described  in  this  section  is  based  on  ZMOB, 
a  parallel  multi-microprocessor  system  under  development 
at  the  University  of  Maryland  T4]).  ZMOB  is  to  consist  of 
256  Z80A  microprocessors  connected  to  a  host  computer 
(VAX-11/780) .  Communication  between  machines  is  via  a  high 
speed,  48  bit  wide,  257  stage  shift  register  called  the 
"conveyor  belt".  Each  processor  is  connected  to  the  con¬ 
veyor  belt  via  a  collection  of  high  speed  8-bit  I/O 
registers,  called  its  "mail  stop".  The  registers  are  in 
charge  of  interrupt  control,  buffering  and  address  control 
functions.  The  system  is  described  in  detail  in  f4,6]  . 
We  shall  briefly  describe  here  only  the  communication 
features  necessary  to  understand  the  material  in  the  sec¬ 
tions  that  follow. 

2.1.  ZMOB  communication  facilities 

As  mentioned  above  the  processors  communicate  by 
sending  messages  to  each  other  on  the  "  conveyor  belt 
Each  processor  sends  information  using  its  own  uniquely 
determined  location  on  the  belt,  called  its  bin.  Each 


processor  may  read  information  from  any  bin,  including  its 
own,  depending  on  the  control  bits  set  in  the  message  con¬ 
tained  in  the  bin.  That  is,  each  processor  can  theoreti¬ 


cally  consume  any  bin  that  is  currently  at  its  mail  stop, 

but  it  can  send  information  out  only  in  its  own  bin.  The 

control  bits  in  the  message  allow  the  implementation  of 

several  communication  strategies,  as  explained  below.  The 

message  contained  in  each  bin  may  be  described  by  a  4- 

tuples  (C  X  S  D) ,  where  C,  X,  S,  D  correspond  to  control, 

message  content,  source  address,  and  destination  address, 

respectively.  Different  control  bits  specify  the  follow- 

# 

ing  communication  formats: 

COMM- l . 

Direct  addressing  -  The  messaqe  X  is  sent  to  a  pro¬ 
cessor  whose  Dhysical  address  is  D. 

COMM- 2. 

Pattern  matching  -  Message  X  is  sent  to  the  first 
processor  whose  oattern  (determined  by  Capability 
Code  and  Mask  Registers  in  the  Mail  Stop)  matches  D. 

COMM- 3. 

Send  to  all  processors  -  Message  X  is  sent  to  all 
processors . 

COMM- 4. 

Send  to  a  set  of  processors  -  Message  X  is  sent  to 
all  processors  whose  patterns  match  D.  Additionally, 
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different  settings  of  Control  Registers  in  the  Mall 
Stop  allow  the  following: 

COMM- 5. 

Exclusive  source  -  This  mode  provides  exclusive 
conversation  between  two  processors  and  disables 
interrupts  from  other  processors. 

COMM- 6. 

Readback  -  This  mode  allows  an  individual  processor 
to  intercept  any  of  its  own  messages  that  has  gone 
around  the  conveyor  belt  and  was  not  consumed  by  any 
destination  processor. 

Though  in  principle  ZMOB  is  an  asynchronous  machine, 
for  the  ourpose  of  this  paper  we  shall  assume  synchronous 
operation.  That  is,  we  assume  that  a  unit  time  is  a  com¬ 
plete  revolution  of  the  belt,  starting  from  the  point  when 
every  bin  resides  at  the  mail  stop  of  the  processor  that 
owns  that  bin.  The  unit  time  ends  when  each  processor  has 
had  a  chance  to  read  the  message  that  was  sent  to  it.  As 
we  shall  see  shortly,  at  each  communication  step  only  one 
message  may  be  sent  to  any  processor.  Moreover,  at  each 
communication  step  each  processor  may  send  out  only  one 


message 
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2.2.  ZMOB  as  a  model  for  parallel  computation 

Formalizing  the  above  concepts,  at  each  execution 
step  ZMOB  may  be  modeled  as  a  directed  graph.  The  nodes  in 
the  graph  correspond  to  the  processors  connected  on  the 
belt,  and  the  arcs  correspond  to  the  communication  links 
among  the  processors.  Each  processor  is  assumed  to  have 
internal  memory.  The  hardware  configuration  allows  for  a 
processor  to  communicate  with  all  the  processors  connected 
to  it  in  one  revolution  of  the  belt.  The  belt  is  assumed 
to  be  so  fast  relative  to  the  processors  that  each  proces¬ 
sor  can  communicate  with  the  processors  connected  to  it  in 
unit  time.  The  interconnection  tocology  at  each  step  is 
determined  as  follows: 

1.  If  PE^  sends  a  message  by  physical  address  then 
depending  whether  it  was  sent  bv  COMM-1  or  COMM-3  it 
is  assumed  to  be  connected  to  one  or  all  processors 
in  the  graph. 

2.  If  PE^  sends  a  message  by  pattern  c(  to  a  set  of  pro¬ 
cessors  then  it  is  assumed  to  be  connected  to  all  the 
processors  that  have  c(  as  their  receiving  pattern. 

3.  Each  processor  may  have  multiple  outgoing  arcs.  How¬ 
ever,  it  may  output  only  one  message  at  a  time. 

4.  Each  processor  may  have  multiple  incoming  arcs.  How¬ 
ever,  it  may  receive  onlv  one  message  at  a  time.  To 
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prevent  confusion  all  the  algorithms  presented  in  the 
paper  assume  that  a  processor  may  have  at  most  one 
incoming  arc. 

5.  A  processor  may  not  have  incoming  and  outqoing  arcs 
at  the  same  time.  That  is,  at  each  revolution  of  the 
belt  the  processor  is  either  receiving  or  sending 
information,  but  not  both. 

To  simplify  matters  we  shall  assume  that  each  execu¬ 
tion  step  of  ZMOB  consists  of  two  phases:  a  communication 
step  and  an  execution  step.  The  communication  step  is 
further  subdivided  into  a  sending  step,  when  all  the  pro¬ 
cessors  load  their  respective  bins,  and  a  receiving  step, 
when  all  the  processors  consume  the  messages  sent  to  them 
during  the  sending  step.  The  execution  step  is  assumed  to 
be  one  computational  step  of  the  processor,  e.g.,  com¬ 
parison  and  addition  of  two  integers.  The  computational 
step  is  assumed  to  be  at  least  as  long  as  the  communica¬ 
tion  step. 

Conditions  1-3  make  the  model  more  powerful  than  a 
simple  communication  ring.  Conditions  4-5  distinguish  the 
model  from  shared  memory  models  such  as  the  CREW-PRAM.  To 
see  that  ZMOB  is  indeed  weaker  than  a  model  that  allows  P 
processors  to  read  simultaneously  from  anv  location,  one 
needs  to  envision  a  CREW  PRAM  algorithm  that  in  one  atomic 
step  performs  P  reads  from  P  locations,  stored  in  a  N- 


location  shared  memory.  To  simulate  the  above  situation 
on  ZMOB,  we  need  to  distribute  the  N  locations  on  P  pro¬ 
cessors,  by  storing  N/P  locations  in  the  internal  memory 
of  each  processor.  Now,  in  the  worst  case,  all  the  simul¬ 
taneously  read  locations  may  reside  in  the  same  PE.  Conse¬ 
quently,  if  P  processors  are  to  read  from  some  processor 
PE^  ,  we  have  P  outgoing  arcs  from  PE^.  However,  by  condi¬ 
tion  4,  PE^  can  output  only  one  element  at  a  time.  Thus,  a 
P-orocessor  ZMOB  can  simulate  an  arbitrary  P-processor 
CREW  PRAM  in  0(P)  time,  and  the  bound  is  tight. 

In  the  following  sections  we  shall  show  several 
nontrivial  adaptations  of  CREW  PRAM  algorithms  for  ZMOB, 
sacrificing  only  a  constant  factor. 


3.  Searching 


In  this  section  we  consider  the  problem  of  searching 
for  an  element  in  an  N-element  sorted  list  ,  distributed 
on  P  processors.  The  first  thought  that  comes  to  mind  is: 
distribute  the  N  elements  evenlv  among  the  P  processors 
and  perform  a  sequential  binary  search  on  each  processor. 


n  X 


However,  this  intuitivelv  appealing  approach  results  in, 
negligible  speed-up  since  each  processor 


,N 


lg-p*lgN-lgP  comparisons, 


performs 


each  processor 
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3.1.  Algorithm  SEARCH 

We  first  present  an  algorithm  for  searching  that  may 
be  implemented  on  a  shared  memory  computer  capable  of  per¬ 
forming  concurrent  reads. 

Let  Xs*  (x^, . . .  ,x^)  be  a  sorted  list  of  N  elements.  For 
simplicity  assume  N»sn+1-1,  where  s=P+l.  Algorithm 
SEARCH  works  as  follows: 

0.  Let  k=n 

1.  Mark  the  elements  of  x  subscripted  bv  is^l^i  . 

2.  Assign  PE^  to  element  isk  and  compare  it  to  the  ele¬ 
ment  being  searched  for.  At  the  end  of  this  step  PE^ 
records  the  result  of  the  comparison  in  location 
loc.  . 

l 

3.  Now  each  PE^  compares  loc^  ,  and  loci_^  ,  for  l<i 

If  for  some  j  loc../loCj_^  ,  then  the  searched  ele¬ 
ment  is  in  the  j-th  interval. 

4  Let  k=k-l  and  reindex  the  elements  of  the  j-th  inter¬ 
val  by  1,  lulls’*  . 

5.  Repeat  steps  1-5  for  the  elements  in  the  j-th  inter¬ 
val. 

Since  P  comparisons  are  performed  simultaneously  in 
steps  2-3  we  reduce  the  problem  of  searching  s  elements 
to  a  problem  of  searching  s^”D  elements.  Thus,  in  Cn 


steps  we  can  search  the  entire  list  (for  some  constant  C) . 
Consequently  the  time  complexity  of  the  algorithm  SEARCH 
is  of  order 


n-l-CHN+1) 

lg  (P+l) 

Referring  to  [2]  we  note  that  the  algorithm  is  optimal. 


Note  that  step  3  in  SEARCH  may  be  performed  on  the 
ZMOB  model  introduced  in  Section  2  in  one  communication 
step,  where  PE^  sends  loc^  to  PE^+1.  Unfortunately,  step 
2  in  the  algorithm  does  not  readily  applv  on  the  ZMOB 
model  if  we  distribute  the  elements  of  x  by  assigning  the 
first  ^  elements  to  the  first  PE,  the  following  ^  elements 
to  the  second,  and  so  on.  To  see  this,  one  needs  to 
observe  that  after  the  first  comparison  all  the  elements 
that  need  to  be  compared  in  the  next  step  reside  in  the 
same  PE.  Since  the  model  allows  us  to  output  only  one 
location  at  a  time  we  have  to  multiply  the  complexity  of 
the  algorithm  by  P,  yielding  •  Obviously,  one  can 
allocate  all  N  elements  of  the  list  x  to  every  PE;  but 
this  is  hardly  an  optimal  solution. 


The  analysis  presented  above  suggested  the  question 
of  whether  there  is  an  allocation  of  elements  to  proces¬ 
sors  that  has  the  following  desirable  property: 


Definition :  Given  a  parallel  algorithm  A,  an  allocation 
of  elements  to  the  processors  is  said  to  be  a  good  alloca- 


ever  PE^  perforins  an  operation  on  some  element,  that  ele¬ 
ment  already  resides  in  the  memory  of  PE^  . 

Lemma  _3  • 

Let  x*  (x^Xj ,  •  •  •  be  a  sorted  string.  We  shall 

assume  N=*sn+1-1,  where  s=P+l.  Let  Sj  be  the  representa¬ 
tion  of  j,  l£j£N  using  the  base  s,  i.e., 

j sa .  sn+a .  sn-1+, . . .  ,+a.  s1+a. 

■’n  -’n-l  Jo 

Then,  in  algorithm  SEARCH  if  PEi  performs  a  comparison  on 
the  element  at  steo  k  then 

_n .  a  n-l ,  n-k+1 

a ■  s  <  s  +  ,  •  •  • ,+a .  s 

3n  Jn-l  J n-k+1 

and  a.  =i  . 

Jn-k+l 

Proof : 

We  shall  prove  the  lemma  using  induction  on  the 
nv  Tiber  of  steps  k. 

For  k=l  the  elements  being  compared  are  indexed  by  is11, 

l<i  and  PE^  compares  element  isn.  Thus,  if  is  beinq 

compared  by  PE.,  then  s.=isn,  and  a.  *i. 

1  J  Jn 

For  l<k  ,  assume  PE^  compares  x ^ .  Bv  the  induction 
hypothesis  the  elements  compared  at  steo  k-1  were  indexed 
by  integers  of  the  form 
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.n-l.,  ,r_n-(k-l)  +  l 

f  •  •  •  |TL  9 


j*a^  sn+a.  sn-1+, 

3n  1n-l 

for  l£r<_s-l.  Assuminq  that  the  next  searched  interval  is 
given  by 


a.  sn+a .  sn"1+,...,+(rn-l)sn_k+2,a.  sn+a-  s""^, . . .  ,+rns 

■’n  Jn-l  u  ]n  -»n-l  u 

we  find  that  at  step  k  PE^  comDares  the  element  sub¬ 
scripted  by 


,  _n ,  n-l  .  .  n-k+2 . • _n-k+l 

a.  s  +a •  s  +*...,+ (rn-i) s  +is 

-'n  -’n-l  0 


Q  *G  «D  • 


As  an  immediate  corollary  of  the  lemma  we  have 


Theorem  ,3*25 

Let  x,s,N  be  as  above.  Let  c^  be  the  coefficient  of 
the  term  with  the  smallest  exponent  in  s^  ,  i.e.,  the 
representation  of  j  in  the  base  s.  Let  f(j)  be  the  allo¬ 
cation  function  of  elements  of  x  to  orocessors  PE^  ,  l£i£P 
defined  by 

Then  f(j)  is  a  good  allocation  function  for  algo¬ 
rithm  SEARCH. 

Proof : 

The  proof  follows  immediately  from  Lemma  3.1  bv 
observing  that  whenever  PE^  compares  element  j  it  must  be 


n-k+2 


the  case  that 


c 


Q.E.D. 


4.  Merging  on  ZMOB 

In  this  section  we  address  the  problem  of  merging  two 
sorted  strings  of  N  numbers  using  P  processors  in  a  mul¬ 
tiprocessor  like  ZMOB.  We  present  three  different  algo¬ 
rithms  for  merging  with  the  following  characteristics: 

1.  All  have  a  sublinear  lower  bound. 

2.  All  are  based  on  enumeration  -  that  is,  at  the  end  of 
the  merging  every  element  knows  its  absolute  location 
in  the  string  of  length  2N  obtained  bv  merging  the 
two  input  strings  of  length  N. 

The  algorithm  for  merging  presented  in  Section  4.1 
has  a  lower  bound  of  lg  N.  It  fullv  utilizes  the  pattern 
matching  capabilities  of  ZMOB.  The  algorithm  presented  in 
Section  4.2  for  merging  two  N-elements  lists  on  2N  proces¬ 
sors  is  a  nontrivial  adaptation  of  Kruskal's  merging  algo¬ 
rithm  [31  for  CREW-PRAM  to  ZMOB.  This  algorithm  is 
optimal  (  up  to  a  constant  ) . 


The  two  algorithms  in  Sections  4.1  -  4.2  do  not  gen¬ 
eralize  to  the  case  of  merging  N-element  lists  on  P 


processors  for  P<N  .  Therefore,  in  Section  4.3  we  present 
an  optimal  algorithm  for  merging  two  lists  of  length  N 


each  on  2P  processors  for  the  case  P<N  .  The  time  com¬ 
plexity  of  this  algorithm  is  of  order  ^  .  The  optimality 
of  the  algorithm  in  Section  4.3  is  based  on  the  observa¬ 
tion  that  every  parallel  algorithm  of  time  complexity  T 
may  be  converted  to  a  sequential  algorithm  of  complexity 
PT  and  hence  every  parallel  merging  algorithm  of  order 

M 

O(^)  is  optimal. 

4.1.  Merging  Using  Selective  Broadcasting 

In  this  section  we  present  an  algorithm  for  merging 
two  N-element  sorted  strings  of  integers  on  a  2N-processor 
ZMOB .  We  present  a  fairly  detailed  algorithm  in  order  to 
provide  the  user  with  intuition  as  to  how  a  machine  like 
ZMOB  may  be  programmed. 

Let  x=  (x^, . . . ,x^)  and  v» (y^ , . . . ,y^)  be  two  sorted 
lists  of  integers.  The  first  string  is  stored  in  proces¬ 
sors  P^,...,PN,  subsequently  referred  to  as  the  x- 
processors,  and  the  second  string  is  stored  in  processors 
PN+lr *  *  * 'P2N'  referred  to  as  the  y-processors .  The  ele¬ 
ments  are  stored  in  ascendinq  order,  one  element  oer  pro¬ 
cessor.  To  simplify  the  discussion  we  shall  assume 
without  loss  of  generality  that  N=2n-1  ,  and  that  the  end 
elements  of  the  string  y,  y^  and  v^,  are  -infinity  and 
+infinitv  respectivelv .  Each  processor  has  the  following 
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variables.  . 

1.  N[il  =  the  integer  stored  in  PE^ 

2.  Index [i]  *  the  position  of  N[i]  in  the  string  x  or  y 

depending  whether  PE^  is  an  x-processor  or  a  y- 

processor . 

3.  PATTERN [i]  *  the  pattern  to  be  used  by  PEi  during  the 
selective  broadcast  operation. 

4.  RES[i]  *  the  result  of  the  comparison  performed  by 
PEi . 

5.  ENUM [ i ]  *  the  final  position  of  N[i]  in  an  ordered 

enumeration  of  the  strings  x  and  y. 

6.  TEMP [ i ]  »  the  location  of  element  x^  in  the  string 

y.  By  location  we  mean  the  index  j  such  that 

v-)<xi<yj+1. 

For  simplicity  algorithm  MERGE_4 . 1  is  subdivided 
into  two  phases.  During  the  first  phase  we  call  the  pro¬ 
cedure  FIND_PART_ORDER,  with  the  strings  x  and  v,  that 
for  each  element  x^  finds  its  position  in  the  list  v.  The 
positions  are  stored  in  TEMP [ i ] .  In  the  second  phase  the 
procedure  FIND_TOTAL_ORDER  is  called  to  find  the  absolute 
location  of  each  element  in  the  resulting  string. 

Procedure  FIND_PART_ORDER  is  called  with  parameters 
(lx,ux)  and  (ly,uy)  which  correspond  to  the  lower  and 
upper  bounds  of  the  two  sorted  strings  x  and  y  to  be 
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merged.  For  simplicity  we  assume  that  the  elements  in  the 
strings  are  distinct.  Initially  we  call  FIND_PART_ORDER 
with  the  parameters  1,N,N+1  and  2N.  Procedure 
FIND_PART_ORDER  is  given  below. 


0.  For  all  j  such  that  ly<j£u  set 
PATTERN [j]  «  i. 


1.  Broadcast  N[i]  to  all  processors  PEj  such  that 

1y17iuy  . 

2.  For  all  j  such  that  ly£j£uy  let  PE^  compare  N[jl  to 
N[i]  and  store  the  result  of  the  comparison  in  RESfJ]. 


3.  By  comparing  RES[jl  with  RES[j  +  11r  for  all  i,  such 
that  ly<j<uv  find  the  location  of  N[il  in  the  string  y 
and  broadcast  it  to  PE^  . 

4.  Let  TEMP [il  »  the  location  of  N[il  in  y. 

Observe  that  ly-l  _<  TEMP[i]<uv. 


if  ix<i  then 


Otherwise,  (  lly<uly  )  call 

FIND  PART  ORDER (  11  ,ul  ,11  ,ul  ) 

9.  If  12y>u2y  then 

For  all  i,  12x£i£u2x,  set  TEMP  [ i ]  to  u2y 

If  12y=u2y  then 

begin 

FIND_PART_ORDER (  12  ,u2  ,11  ,ul  ) 

y  y  x  x 

For  all  i,  12x<i<u2x,  such  that  i<TEMP[12y]  set 
TEMP [ i ]  »  12y-l 

For  all  i,  12x£i<u2x,  such  that  i>TEMpri2v]  set 
TEMP  [  i ]  =  12y 
end 

Otherwise,  {  12y<u2y  )  call 

FIND_PART_ORDER (  12  ,u2  ,12  ,12  ) 

x  x  y  y 

coend  steps  8-9. 
end 

end 

Explanation : 


Intuitively,  procedure  FIND  PART  ORDER  works  as  fol 


2 


At  steps  0-1  the  middle  element  of  x,  x,,  is  broad¬ 
cast  to  all  the  y-processors .  This  defines  two  segments  of 
x,  and  Xj  .  In  step  6  we  determine  the  lower  and  the 
upper  bounds  of  each  X-segment.  Since  we  conveniently 
have  chosen  N  to  be  2n-l,  the  length  of  each  x-segment  is 


At  step  3  we  find  the  location  of  the  element  in  v. 
This  defines  two  segments  of  y,  Y^  and  Y 2  .  In  step  7  we 
determine  the  lower  and  the  upper  bounds  of  each  y- 
segment. 

Clearlv  we  can  now  separately  merqe  X^  with  Y±  and 
x2  with  Y 2  .  This  is  accomplished  with  the  recursive 

calls  in  steps  6-8,  and  the  set  up  of  the  y-processors  in 
step  3.  In  step  3  all  the  Y^  processors  set  their  patterns 
to  the  index  of  the  middle  element  of  X^.  Similarly  the 
X2  processors  set  their  pattern  to  the  middle  element  of 
x2* 

There  are  several  cases  that  need  to  be  taken  care  of 
separately.  The  first  two  cases  arise  when  TEMP [ i ]  <  l 
i.e.  the  middle  element  of  x  is  smaller  than  all  the  ele¬ 
ments  in  v,  or  when  TEMP  [  i ]  *  u  sub  v,  i.e.,  the  middle 
element  of  x  is  larger  than  all  the  elements  in  y.  In  the 
former  case  we  merqe  all  the  elements  of  to  the  left  of 
lv,  and  the  latter  case  we  merge  all  the  elements  of  X2  to 
the  right  of  u  . 
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The 

last 

exceptional 

case 

that  needs 

to 

be  taken 

care  of 

is 

the 

case  when 

either 

Y ^  or  Y?  is 

of 

length  1. 

Without 

loss 

of 

generality 

assume 

that  | Yx | — 1 

/ 

the  ele- 

ment  in  Y  is  y ^  and  the  integer  in  Y^  is  N[I]  .  Clearly, 
in  this  case  the  elements  of  X^  will  be  merged  either  to 
the  left  or  to  the  right  of  the  element  in  Y^  .  Thus,  the 
position  of  each  element  in  the  segment  X^  may  be  deter¬ 
mined  by  inserting  the  singleton  element  of  y  in  X^,  which 
may  be  done  by  calling  FIND__PART_ORDER  with  Y^  and  X^  in 
this  order.  Once  the  location  of  the  v  element  in  X^  has 
been  determined  we  can  set  the  location  of  all  the  ele¬ 
ments  in  X^  greater  than  NUl  to  be  N[i],  and  the  loca¬ 
tions  of  all  the  elements  in  X^  smaller  then  N[l]  to  be 
N[i]-1  . 

Procedure  FIND_PART_ORDER  terminates  when  the  size 
of  all  x-segments  is  zero. 

Proposition  _4.1: 

Algorithm  FIND_PART_ORDER  correctly  finds  the  rela¬ 
tive  location  of  the  x-elements  in  the  string  v  in  O(lgN) 
time . 

Proofs 

It  is  easy  to  see  that  each  invocation  of 
FIND_?ART_ORDER  with  segments  X  ,  Y  either  finds  the 
location  of  X  in  Y  in  case  X  is  a  singleton  list,  or 
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creates  two  recursive  calls:  FIND_PART_ORDER  with  (  X^  f 
Yx  )  and  FIND_PART_ORDER  with  (  X2  #  Y2  )  .  The  lengths 

V 

of  X1  and  X2  are  smaller  than  4-j4  Thus,  algorithm 
FIND_PART_ORDER  is  guaranteed  to  terminate.  It  is  also 
easy  to  verify  that  algorithm  FIND_PART_ORDER  spans  a 
binary  tree  of  recursive  calls  to  FIND_PART_ORDER.  At 
level  k  of  the  binary  tree  2  elements  of  X  are  merged 
into  Y  simultaneously.  At  the  root  level  the  middle  ele¬ 
ment  of  X  is  compared  to  all  the  elements  of  Y,  at  the 
second  level  two  elements  of  X  are  compared  to  the  "right" 
elements  of  Y,  and  so  on.  Consequently,  it  suffices  to 
orove  by  induction  that  at  each  level  the  chosen  elements 
of  X  are  merged  into  the  string  Y  in  the  correct  loca¬ 
tions.  The  proof  using  induction  is  similar  to  the  proof 
for  Lemma  3.1  and  is  left  as  an  exercise  to  the  reader. 

We  must  ensure  that  on  each  level,  each  processor  is 
required  to  compare  its  local  element  to  only  one  element 
of  the  other  string.  However,  a  processor  may  be  asked  to 
compare  more  than  one  element  at  a  time  only  if  two  calls 
to  FIND_PART_ORDER  are  created  with  the  same  strings,  and 
this  may  happen  only  if  the  procedure  FIND_PART_ORDER  is 
called  with  a  Y  or  X  strinq  of  length  one.  This  case  is 
taken  care  of  by  the  soecial  check  in  steos  7-3. 

Thus,  algorithm  FIND_PART_ORDER  correctly  finds  the 
oartial  order  of  the  elements  of  X  in  Y  in  0(lg(N))  time. 


At  this  point  each  x-processor  knows  its  relative 
location  in  the  y-string.  Obviouslv,  we  can  reverse  the 
parameters  in  the  call  to  FIND__PART_ORDER  and  determine 
the  relative  location  of  each  y-element  in  x.  Now,  it 
remains  to  show  that  we  can  find  the  absolute  location  of 
each  element;  this  is  done  by  procedure  FIND_TOTAL_ORDER 
given  below. 

The  input  to  FIND_TOTAL_ORDER  is  two  sorted  strings  x 
and  y,  such  that  |x|*|y|»N  .  We  assume  each  element  in  x 
knows  its  relative  location  in  v.  We  recall  that  by  the 
relative  location  of  xt  in  y  we  mean  the  j ,  l£j£N  such 
that  yj<xi<Yj+1  .  We  also  recall  that  each  relative  loca¬ 
tion  is  stored  in  TEMP[i]. 

• 

The  output  of  FIND_TOTAL_ORDER  is:  each  x-processor 
knows  its  absolute  location  in  the  string  resulting  from 
merging  x  and  y. 

Algorithm  FIND_TOTAL_ORDER  (x,v) 
begin 

1.  For  each  j,  l<j<N-l  check  if  TEMP f j 1  <  TEMP ( j  +  1] . 
Mark  all  those  PEs  whose  TEMP [ j 1  is  greater  than 
TEMP  T j - 1 1 •  We  shall  refer  to  such  a  processor  as  a 
locally  minimal  PE. 

2.  For  each  j,  2<j<N  check  if  TEMP [ j 1  >  TEMP [ j  -  1] . 


Mark  all  those  PEs  whose  TEMPfj]  is  smaller  than  TEMPfj 
+  1] .  We  shall  refer  to  such  a  processor  as  a  locally 


maximal  PE 


3.  Now  let  each  locally  maximal  x-processor  PE^  send  its 
index  INDEX [j]  by  pattern  to  the  v-orocessor  indexed  by 
TEMP [ j ]  +  1. 

At  the  end  of  this  step  each  y-processor  that  received  a 
message  from  an  x-processor  can  compute  its  absolute 
position  bv  adding  the  message  content  to  its  own  index. 
Note  that  each  y-processor  mav  receive  at  most  one  mes¬ 
sage. 

4.  Now,  for  all  j  such  that  PE^  is  not  a  locally 

minimal  PE,  have  PE^  do: 

PATTERN [j]  *  TEMP [ j ] 

5.  Now,  for  all  j  l<.j£N,  such  that  PE^  is  the  locally 
minimal  PE,  have  PE^  do: 

Send  INDEX  [j]  by  pattern  TEMP  [  j  J 
At  the  end  of  this  step  each  PEj  may  compute  its  rela¬ 
tive  position  among  all  those  PEs  with  the  same  relative 
location  in  y.  This  may  be  done  by  subtracting  the 
index  of  the  locally  minimal  PE  from  the  index  of  the 
processor . 

6.  Now,  for  all  j  l<j<N  have  PEj  do: 

PATTERN [j]  -  TEMP [ j ] 

7.  Let  all  the  y-processor s  PE^  that  know  their  absolute 
location  broadcast  it  by  the  pattern  INDEX fk]  -  l. 

8.  Now  each  x-processor  may  compute  its  absolute  loca¬ 
tion  by  adding  the  absolute  location  of  v-orocessor 
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received  at  step  7  and  the  relative  position  computed  at 
step  5. 
end 

Notes: 

1.  All  the  steps  in  FIND_TOTAL_ORDER  are  atomic  steps. 
Thus,  the  time  complexity  of  F I ND_TOT AL_ORDER  is  con¬ 
stant. 

2.  By  reversinq  parameters  in  the  call  to 
FIND_TOTAL_ORDER  we  may  find  the  absolute  locations 
of  the  elements  of  y  in  x. 

Theorem  4.1: 

Let  x  and  y  be  two  sorted  strings  of  length  N,  dis¬ 
tributed  in  ascending  order  on  a  2N-processor  ZMOB.  Then 
we  can  sort  the  two  strings  in  O(lgN)  time. 

Proof:  The  theorem  follows  from  Proposition  4.1  and  the 
constant  time  complexity  of  procedure  FIND_TOTAL_ORDER . 

Optimal  Merging  on  ZMOB 

In  Section  3  we  found  that  one  way  to  achieve  lower 
bound  performance  on  ZMOB  is  by  distributing  the  informa¬ 
tion  in  such  a  way  that  each  time  the  P  processors  perform 
P  reads,  we  insure  that  each  of  the  elements  accessed  by 
PEj  already  resides  in  the  memorv  of  PE j .  However,  this 
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is  not  always  possible.  In  this  section  we  use  one  more 
fundamental  trick  that  allows  deriving  an  optimal  (  up  to 
a  constant  )  merging  algorithm. 

Definition: 

Given  a  parallel  algorithm  A,  an  allocation  of  ele¬ 
ments  to  the  processors  is  said  to  be  a  nearlv-good  allo¬ 
cation  for  A  if  at  any  time  during  the  execution  of  A  when 
two  distinct  PE^s  perform  an  operation  on  two  elements  s 
and  r  the  following  holds: 

1.  The  elements  r  and  s  reside  in  two  different  oroces- 
sors. 

2.  If  the  elements  reside  in  the  same  PE  then  s  »  r. 
Lemma  4.2: 

Let  A  be  some  CREW  PRAM  algorithm  and  let  f  be  a 
nearly-good  allocation  function  for  A.  Then  the  algorithm 
A  may  be  simulated  on  ZMOB  in  constant  time. 

Proof: 

Each  time  P  processors  want  to  read  P  locations,  they 
broadcast  the  requests  for  these  locations.  We  shall 
assume  that  each  processor  knows  to  what  PE  the  location 
it  is  trying  to  read  has  been  allocated.  Thus,  at  each 
concurrent  read  step  of  algorithm  A,  P  requests  for  loca¬ 
tions  are  broadcast  on  the  bus. 


Clearly,  if  there  is  no  contention  for  processors  the 
fetches  may  be  performed  by  having  each  PE  that  has  the 
required  location  respond  to  the  sender. 

A  problem  may  occur  in  cases  when  two  or  more  oro- 
cessors  are  contending  for  locations  in  the  same  proces¬ 
sor.  However,  if  f  was  a  nearly-good  allocation  function 
for  A,  all  the  requested  elements  that  reside  in  the  same 
PE  are  equal.  Thus,  the  processor  that  received  the 
request  for  a  location  may  send  the  location  out  by  using 
its  own  index  as  a  oattern.  Irrespective  of  the  number  of 
requests  sent  to  a  processor  it  will  consume  and  respond 
to  one  request  only.  If  all  the  processors  that 
requested  the  location  set  their  receiving  pattern  to  the 
index  of  the  orocessor  they  requested  information  from, 
the  data  will  be  delivered  to  all  those  processors  in  unit 
time.  Thus,  algorithm  A  may  be  simulated  on  ZMOB  in  con¬ 
stant  time. 

Q.E.D. 

The  lemma  above  has  an  immediate  corollary. 

Corollary  4.2: 

A  P-processor  P-element  memory  CREW-PRAM  algorithm 
may  be  simulated  on  ZMOB  in  constant  time. 

Corollary  4.2  has  an  immediate  important  application 
to  merging.  In  f3]  we  find  an  algorithm  for  merging  two 


sorted  N-element  strings  on  an  N-processor  CREW-PRAM  in 
0(lglg(N))  time.  The  algorithm  is  optimal  up  to  a  con¬ 
stant  [8].  Therefore,  by  Corollary  4.2  the  algorithm  may 
be  simulated  on  a  2N-processor  ZMOB  with  constant  time 
overhead.  Kruskal's  algorithm  performs  the  same  function 
procedure  FIND_PART_ORDER  performs  in  Section  4.2,  that 
is,  for  each  element  it  finds  its  relative  location  in  the 
other  string.  Recalling  that  the  time  comolexity  of 
FIND_TOTAL_ORDER  for  N*P  is  constant  we  concludes 

Theorem  4.^: 

The  lower  bound  for  merqing  two  N-element  sorted 
strings  on  a  2N-processor  ZMOB  is  0(lglg(N)). 

£.3.  Merging  two  long  strings  with  a  small  number  of  pro¬ 
cessors 

Let  X  Y  be  two  sorted  strings,  and  assume  jx  =|yJ=N. 
In  this  section  we  show  that  we  can  merge  the  two  strings 
on  a  2P  processor  ZMOB  in  0(N/P)  time  using  algorithm 
MERGE_4 . 3 . 

The  input  to  MERGE_4 . 3  is  given  in  the  form  of  two 
strings  that  are  initially  distributed  in  ascending  order 
on  2P  processors.  For  simplicitv  assume  N*SP. 

The  output  of  MERGE_4 . 3  iss  each  processor  knows  its  abso¬ 
lute  location  in  the  string  resulting  from  the  merge. 
Algorithm  MERGE_4 . 3  is  given  below. 


1.  Choose  the  elements  indexed  by  is  in  each  string 
There  are  no  more  than  P  chosen  elements  in  each  string. 
Moreover,  there  is  only  one  chosen  element  in  each  proces¬ 
sor.  In  fact  the  chosen  element  is  the  last  element  in 
each  PE. 

2.  Merge  the  chosen  elements  of  each  the  strings.  This 
step  may  be  done  in  O(lglgP)  time  as  shown  in  Section  4.2. 
Now,  note  that  the  chosen  elements  define  P  equal-length 
intervals  in  each  strinq,  denoted  by  (X^, . ...  ,xp)  and 
(Y1,...,Yp)  .  At  the  end  of  this  step  we  know  to  what 
interval  in  the  other  string  each  chosen  element  belongs. 

In  step  3  we  find  the  exact  relative  position  of  each 
chosen  element  of  X  in  Y.  Delegate  an  x-processor  to  each 
chosen  element  of  X.  Set  the  pattern  of  this  processor  to 
be  the  index  of  the  interval  in  Y  that  the  chosen  element 
belongs  to.  Formally,  let  PE^g  set  its  receiving  pattern 
to  the  index  of  the  v-interval  that  element  x^g  is  in. 

3.  Broadcast  the  content  of  each  y-interval  on  the 
belt.  Note  that  each  x-processor  communicates  with  onlv 
one  y-processor,  while  a  y-processor  may  be  communicating 
with  more  than  one  x-processor  or  none  at  all.  At  the  end 
of  this  step  each  x-orocessor  contains  the  content  of  the 
y-interval  where  the  chosen  element  of  x,  residing  in  the 
x-processor,  belongs.  Now  each  x-processor  mav  find  its 
relative  position  in  the  string  v,  bv  performing  a 
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sequential  binary  search  on  the  content  of  the  v-interval 

it  contains.  Since  the  length  of  each  y-interval  is 

bounded  by  ^  ,  and  each  processor  is  required  to  output  or 

receive  only  one  element  at  a  time,  the  time  complexitv  of 

steps  3  is  of  order  0<N/P)  +  lg(N/P)  . 

4.  Repeat  the  above  step  for  the  chosen  elements  of  Y. 

The  positions  of  the  elements  chosen  in  step  1  and  the 

relative  positions  found  in  steps  3-4  define  2P  segments 

in  each  string  denoted  by  X.  and  Y.  resoectively ,  where 

3  j 

l<,j£2P.  Each  segment  is  defined  uniquely  by  its  right 
end-point.  This  subdivision  defines  2P  disjoint  pairs 


) 


j 


that  may  be  merged  separatelv. 


5.  Now,  we  DELEGATE  (see  below)  2P  processors  to  the  2? 


pairs.  Once  the  pair  of  x-y  segments  indexed  by  the  same 


integer  i^,  l£ij£2P  ,  reside  in  the  chosen  processor,  we 
may  merge  them  sequentially  in  0 (^)  time.  This  is  true 
since  each  segment  in  each  pair  is  at  most  (^)  long. 


We  must,  therefore,  show  that  DELEGATING  2P  proces¬ 
sors  to  the  corresponding  pairs  of  x-y  segments  may  be 
accomplished  in  O(^)  time.  Without  loss  of  generality  we 
shall  discuss  only  the  x-segments. 


We  first  observe  that  at  the  end  of  step  4  each  of 
the  2P  processors  PE^  contains  an  integer  e^ ,  O^e^N,  that 
corresponds  to  the  end  of  some  x-segment.  The  x- 
processors  contain  the  ends  of  the  x-segments  defined  in 


step  1,  while  the  y-processors  contain  the  ends  of  the  x- 
segments  created  in  step  4.  Additionally,  each  of  the  2P 
processors  contains  some  interval  X^,  which  is  the  super¬ 
set  of  the  x-segment  defined  uniquely  by  the  integer  e ^ . 
Thus,  each  PEj  must  determine  the  index  ij  and  the  lower 
bound  of  the  segment  defined  by  e  j .  This  task  is  performed 
in  0 (N/P)  steps  by  Algorithm  4.4.  Once  this  problem  is 
solved  we  may  simply  DELEGATE  the  ijth  segment  to  the  pro¬ 
cessor  in  which  the  end  of  that  segment  resides. 

Algorithm  _4.4 

1.  Note  that  the  inteqers  in  the  x-processors  define  a 
sorted  list.  Also  note  that  the  integers  in  the  v- 
orocessors  define  a  sorted  list.  Thus,  we  may  find  the 
absolute  order  simoly  by  merging  the  integers  in  the  x- 
processors  with  the  integers  in  the  v-processors.  This 
may  be  done  in  O(lglgP)  time.  At  the  end  of  step  1  each 
processor  knows  the  index  of  the  segment  it  is  responsible 
for  and  its  upper  bound. 

2.  Let  T [ j 1  be  the  index  of  PEj  in  the  new  enumeration. 
In  two  steps  PEj  can  find  the  lower  bound  of  the  segment 
stored  in  PEj  by  communicating  to  PE<r[j]-i*  ^his  is  done 
by  letting  each  PEj  set  its  pattern  to  Tfjl,  and  then  let¬ 
ting  each  PE.  send  its  TH1  by  pattern  to  PEmr.,..  . 


Thus,  the  overall  time  complexity  of  Algorithm  4.4  is 


of  order  lqlgP. 

Similar  arguments  hold  for  the  y-processors . 

Finally,  each  of  the  2P  processors  contains  a  sorted 
string  of  length  at  most  2N/P.  The  enumeration  of  all  the 
2N  elements  is  straightforward  and  is  left  as  an  exercise 
to  the  reader.  As  a  result  we  have  the  following  theorem. 

Theorem  4. 2* 

Let  X,  Y  be  two  sorted  strings,  and  asssume 
|XI*}Y|a,N*  For  N  *  SP,  1<S  we  can  merge  the  two  strings 
distributed  in  ascending  order  on  a  2P-processor  ZMOB  in 
0 (N/P)  time  using  2P  processors. 

Note: 

Since  the  lower  bound  to  merge  two  such  strings  on  a 
sequential  computer  is  of  order  0(N),  we  conclude  that  the 
algorithm  presented  above  is  optimal  up  to  a  constant. 

5.  Summary 

In  this  paper  we  have  investigated  the  problem  of 
searching  and  merging  two  N-element  strings  on  a  parallel 
model  of  computation  (ZMOB) .  The  main  results  reported  in 
this  paper  are: 

1.  Parallel  searching  for  an  element  on  a  N-element 
string  distributed  on  P  processors  may  be  Performed 
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on  ZMOB  in  0<N/P)  time.  The  algorithm  given  is 
optimal  up  to  a  constant. 

2.  Two  sorted  strings  of  length  N  may  be  merqed  on  a  N- 
processor  ZMOB  in  O(lglgN)  time. 

3.  Two  sorted  strings  of  length  N  may  be  merged  on  an 
N-processor  ZMOB  in  0(N/P)  time. 

The  results  reported  in  this  paper  may  be  used  for 
solvinq  a  variety  of  problems.  Clearly,  the  IgN  merginq 
algorithm  of  Section  4.1  may  be  used  to  derive  a  lg2N 
sorting  algorithm.  In  a  subsequent  paper  we  plan  to  extend 
the  results  in  Section  4  to  derive  an  optimal  oarallel 
algorithm  for  AND-ing  (OR-ing)  two  binary  strings 
represented  by  run  length  codes.  In  this  representation  a 
binary  string  is  represented  by  the  value  of  the  first 
element  of  the  string  followed  by  a  string  of  integers 
that  represent  the  successive  runs  of  Os  and  Is  by  their 
respective  lengths.  For  a  larqe  class  of  binary  strings 
this  representation  is  more  compact,  and  it  is  widely  used 
in  the  domain  of  signal  and  image  processing.  AND-ing  and 
OR-ing  on  a  sequential  computer  may  be  performed  in  0(K) 
time  using  this  representation,  where  K  is  the  number  of 
runs.  We  show  that  we  can  AND  two  strinqs  having  K  runs 
each  in  0(K/P)  time  on  P  processors.  Parallel  orocessinq 
of  run  length  codes  has  not,  to  our  knowledge,  been  previ¬ 
ously  studied. 
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Although  the  algorithms  in  this  paper  were  described 
for  a  particular  machine  organisation,  they  generalise 
naturallv  to  any  machine  that  satisfies  conditions  1-5  in 
Section  2.  Note  that  all  the  algorithms  presented  in  Sec¬ 
tions  3-6  used  relatively  few  communication  links  among 
processors.  In  fact  at  any  step  of  the  execution  the 
interconnection  graphs  induced  by  the  algorithms  had  only 
P  edges  at  the  most.  This  is  true  since  in  all  the  algo¬ 
rithms  either  the  x-processors  communicated  to  the  y- 
processors  and  vice  versa  or  the  x-y  processors  communi¬ 
cated  among  themselves  in  a  mesh-connected  fashion. 

In  [7]  it  was  argued  that  broadcasting  "cannot  signi¬ 
ficantly  aid  sorting  algorithms".  In  this  paper  we  have 
demonstrated  that  selective  broadcasting  allows  a  dynami¬ 
cally  reconf igurable  interconnection  topologv,  and  proves 
to  be  an  extremely  oowerful  construct  for  mesh-connected 
like  computers.  In  particular  it  allows  one  to  attain 
maximal  speed-ups  for  searching  and  merging. 

Finally,  we  would  like  to  conclude  with  a  pragmatic 
remark.  The  philosophy  behind  ZMOB  is  simple.  In  many 
cases  it  is  cost-effective  to  build  a  parallel  machine 
from  a  collection  of  slow,  cheap  processors  connected  by  a 
very  fast  communication  medium.  In  other  cases  we  would 
like  to  experiment  with  shared  memorv  or  dvnamicallv 
reconf igurable  systems.  In  either  case  ZMOB  may  be  an 


effective  simulation  tool.  As  mentioned  in  the  introduc¬ 
tion  ZMOB  is  currently  under  construction  at  the  Univer¬ 
sity  of  Maryland.  A  16-processor  ZMOB  is  partially  opera¬ 
tional  and  is  undergoing  extensive  testing  and  debugging, 
and  other  processors  are  being  connected  to  the  conveyor 
belt.  The  communication  facilities  described  in  Section  2 
are  fully  supported  on  the  belt  though  it  is  premature  to 
conjecture  whether  the  theoretical  model  introduced  in 
this  paper  is  indeed  realizable  in  practice. 
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